Authors:
We are immensely grateful for the collaboration and support of zeb Consulting, which has greatly contributed to the success of this project.
</div>This project aims to develop a machine learning model to predict the leasing prices of vehicles based on various attributes.
In the current macroeconomic environment, accurately forecasting leasing asset values and pricing is crucial for leasing banks. Additionally, the automotive market has experienced significant price fluctuations and supply chain disruptions, further emphasizing the need for reliable predictions. Leveraging a dataset provided by a leasing bank, our research and development efforts focus on building the most accurate prediction models.
The final outcome will be a graphical user interface (GUI) that allows users to input vehicle details and obtain leasing rate predictions. By employing state-of-the-art machine learning techniques and addressing challenges such as data quality and model selection, this project aims to provide an effective tool for leasing banks in assessing asset values and making informed pricing decisions.
To develop high-quality machine learning algorithms and streamline data processing, we utilized state-of-the-art libraries such as pandas, numpy, and sklearn in this project. These libraries enabled us to implement advanced techniques and achieve efficient data manipulation and analysis. If you run into any problems with the imported libraries, please refer to the Debugging library versions section.
To optimize the computational and time requirements of building machine learning models, we have introduced a section that allows you to choose between computing new models or importing existing ones. Additionally, you will be prompted to evaluate your machine's performance, ranging from "Ludicrous" (highest performance) to "Low" (lowest performance). Your response will determine the number of iterations and cross-validations in the subsequent Random Search process.
Moreover, the number of threads used for model building is limited to your available threads minus two, ensuring that your machine remains usable during the process. This approach aims to strike a balance between model generation and system usability.
To import the data, please specify your data folder in the following cell.
If you want to import your models, please specify the folder, from which they shall be improted in the following cell.
If you want to use a different or new dataset, please ensure that you correctly assign the column names of your dataset in the following dictionary.
This shall insure, that without much change, the models can be rebuild on a different dataset.
In order to ensure data usability, certain features require formatting adjustments. These adjustments involve removing units, replacing commas with decimal points, and calculating the age based on registration information. These transformations are performed within the "basic preprocessing pipeline" to prepare the data for further analysis and modeling.
The shape of our basic preprocessing pipeline looks like:
This section contains the visualization of the dataset, which serves as an essential step in understanding the underlying patterns and relationships within the data. This exploratory analysis aims to provide insights into the dataset's characteristics and unveil meaningful trends that can guide subsequent modeling efforts.
In this section, we will present an overview of the target variable in this analysis.
Specifically, the target variable in our predictor refers to the monthly leasing price associated with a leasing asset.
On this histogramm we see the monthly fee distrubtion for the cars in the dataset. The distribution exhibits a right-skewed pattern, indicating that a majority of the leasing rates fall towards the lower end of the scale with minimum leasing rates. starting from approximately 250 euros. The peak of the distribution occurs around 350 euros, with a significant number of cars falling within this monthly fee range. Most of the values lie between 300-900 euros. What is interresting is that we can see that there are very few cars that have a leasing rate between 2000 and 2500 euros. Those cars are not outliers but rather just very expensive cars.
The boxplot for the monthy fee distribution verifies what we observed from the histogram. The minimum monthly fee value is approximately 250 euros, while the maximum reaches up to 1200 euros. The median, which represents the middle value of the distribution, lies slightly above 500 euros. Upon closer examination, it becomes apparent that the boxplot exhibits numerous outliers. However (as stated earlier), these outliers are not indicative of data anomalies but rather represent the presence of very expensive cars with exceptionally high leasing rates. This observation highlights the diversity within the dataset, as it encompasses both affordable and luxury vehicles.
The numerical features gathered are:
Scatterplots provide a visual representation to explore the relationship between numerical variables and the target variable in our dataset. In our analysis, we plotted our target variable on the y-axis against all the numeric variables on the x-axis.
From the scatterplots, several key findings emerged. First, a strong linear relationship was observed between the monthly fee and the horsepower. As the horsepower increases, the monthly fee tends to rise as well, indicating a positive correlation between these variables.
Additionally, we discovered that lower mileage and more recent initial registrations are associated with higher monthly fees. This relationship is evident from the concentrated distribution of points in the corresponding regions of the scatterplots, suggesting that these factors have a noticeable impact on leasing rates.
Examining the scatterplots for consumption and duration, we noticed an interresting pattern. Despite variations in duration and consumption, the monthly fees are relatively evenly distributed. This implies that these variables may not exert a significant influence on the monthly fee, as indicated by the consistent spread of points across different durations and consumption levels.
Regarding the emission variable, we observed a dense cluster of points in the middle range, signifying a large number of vehicles with emission values between approx. 100-250. This indicates that both low-cost and high-cost vehicles exist within this emission range, potentially reflecting a diverse market segment with various pricing factors beyond emissions alone.
Overall, these insights from the scatterplots shed light on the relationships between the numerical variables and the monthly fee, providing valuable information for feature selection and understanding the factors influencing leasing rates.
When examining the skewedness of the numerical variables in our dataset, interesting patterns emerge. Specifically, the registration and mileage variables exhibit right-skewed distributions, indicating a high concentration of new vehicles. This suggests that a significant portion of the dataset comprises recently registered vehicles with relatively low mileage.
In contrast, the emission and consumption variables demonstrate symmetric distributions, implying a more balanced distribution of values. The absence of skewness in these variables suggests that the dataset encompasses a diverse range of emission and consumption values without a pronounced bias towards higher or lower values.
The categorical features gathered are:
The barplots provide insights into the fuel types and gear types of the vehicles in our dataset. We can observe that the dataset comprises three primary fuel types: diesel, gasoline, and hybrid vehicles. Additionally, the barplots reveal the presence of two gear types: automatic and manual shifts. This highlights the transmission options available within the datase.
Upon exploring the brands present in our dataset, we identified several notable findings. The most frequently occurring brands among the vehicles in our dataset are Seat, Volkswagen, Audi, Skoda, Cupra, BMW, and Opel.
The inclusion of these brands in our dataset represents a mix of mainstream vehicles, providing a comprehensive view of the market and enabling us to analyze the impact of brand types on the monthly leasing fees.
This section provides an overview of how the categorical features influence the target variable, which is the monthly fee.
Analyzing the relationship between the target variable and the categorical variables in our dataset revealed interesting insights. Plotting the target variable against different categorical variables allowed us to examine their impact on leasing rates.
From the visualizations, we observed that BMW models, Citroën, and Audi tend to have higher average leasing prices compared to other brands.
Moreover, when considering the fuel type, we found that hybrid vehicles tend to have higher monthly fees compared to gasoline or diesel vehicles. This is attributed to the increased cost of hybrid technology and the potential for fuel savings over time.
Additionally, we observed that automatic cars are generally more expensive to lease compared to cars with manual shifting. This is very likely perceived convenience and comfort associated with automatic transmission, which can contribute to higher demand and pricing.
In this section, we present a plot showcasing the correlations between the features in the dataset. By visualizing the correlations, we gain insights into the relationships and dependencies among the different attributes. This correlation plot provides a comprehensive overview of the interplay between the numerical and categorical features, highlighting which variables are positively or negatively correlated. Understanding these correlations helps us identify potential patterns and dependencies that can significantly impact the target variable, providing valuable insights for subsequent analysis and model development.
This correlation matrix measures the relationships between different variables related to a car, namely: milage, first_registration, duration, monthly_fee, emission_value, consumption, horsepower, and kilowatts. The correlations range from -1 (perfect negative correlation, as one variable increases, the other decreases) through 0 (no correlation, the variables do not move together) to +1 (perfect positive correlation, the variables increase and decrease together).
Monthly_fee and Horsepower/Kilowatts:
These pairs show very strong positive correlations (0.827053 and 0.826905, respectively). This suggests that cars with higher horsepower or kilowatts are associated with higher monthly fees. This could mean that more powerful vehicles tend to have higher monthly costs, potentially due to reasons such as higher insurance premiums, increased fuel consumption, or greater maintenance requirements.
Monthly_fee and Duration:
There is a moderate negative correlation (-0.280965) between these variables, indicating that as the duration of ownership increases, the monthly fee decreases. This could be because the costs associated with a car (like loan payments or certain insurance costs) often decrease over time.
Monthly_fee and Mileage/First_registration:
The correlation between monthly_fee and these two variables is relatively weak (-0.060930 and -0.041417, respectively). This suggests that the monthly fee does not change significantly with changes in the mileage of the car or its first registration date. It's worth noting though, that in some cases, older cars (with an earlier first registration date) or cars with higher mileage could potentially have higher maintenance costs which could affect the monthly fee.
Monthly_fee and Emission_value/Consumption:
The correlations here are also very weak (-0.008253 and -0.012807, respectively). These small negative correlations suggest that cars with higher emissions or consumption are associated with slightly lower monthly fees, although this relationship is very weak. This may be because cars with higher emissions or fuel consumption tend to be older models, which could have lower associated costs in some areas (like lower insurance or depreciated value).
Mileage and First_registration:
These two variables show a strong positive correlation of 0.845908, implying that as the age of the car (as suggested by the first registration) increases, so does the mileage it has run. This is a fairly intuitive relationship since older cars have typically been driven more.
Duration and First_registration:
These two variables have a moderate negative correlation of -0.459295, indicating that as the duration of ownership increases, the car tends to be newer (i.e., has a later first registration date). This might imply that people tend to keep newer cars for longer periods.
Our analysis of the correlation matrix reveals two key variables that significantly influence the monthly fee, namely horsepower and leasing duration. Horsepower shows a strong positive correlation, suggesting that more powerfull cars generally incur higher monthly fees. Conversely, duration of ownership has a negative correlation with the monthly fee, indication that longer leasing contract durations result in lower monthly fees. Other variables, including mileage, first_registration, emission_value, and consumption, show a weaker correlation with the monthly_fee, suggesting a lesser direct impact on this target variable.
As expected, horsepower and kilowatts show a perfect correlation, because they are different units of the same attribute of a car, the engine power. To prevent multicollinearity, which can complicate interpretation of the model, we decided to exclude kilowatts from our subsequent models. This decision helps to streamline our model by eliminating redundant information, focusing on the most relevant predictors for the monthly fee.
As anticipated, there is a perfect 1:1 correlation between kilowatts (kW) and horsepower (HP), where the relationship is defined as P[kW] = 0.7457 * P[HP].
Given this direct correlation, we decided to exclude the kilowatts feature from the dataset when constructing our models.
In this section, we delve into the preprocessing and feature engineering steps of our notebook. These crucial steps lay the foundation for preparing the data and optimizing its suitability for the machine learning process. We begin by splitting the data into out-of-sample, test, and train sets to ensure robust model evaluation and prevent overfitting.
Once the data is appropriately split, we proceed to define transformer pipelines using the renowned scikit-learn library. These pipelines facilitate systematic data transformations and feature engineering, ensuring consistency and efficiency throughout the machine learning process. By employing transformer pipelines, we can seamlessly apply various preprocessing techniques such as scaling, encoding categorical variables, handling missing values, and creating interaction features.
By carefully designing and implementing these preprocessing and feature engineering steps, we enhance the quality and representativeness of our data, enabling the machine learning models to capture meaningful patterns and make accurate predictions. The transformer pipelines in scikit-learn provide a flexible and comprehensive framework for streamlining these essential data preparation tasks, promoting reproducibility and scalability in our analysis.
Only two features present missing values, which are the emission and the consumption.
Missing values will be imputed in the transformer pipeline, using the standard imputed provided by the scikit-learn package.
Analyzing the cardinality helps us understand the number of distinct categories within each feature. In this case, the "Brand" feature consists of 20 unique brands, indicating a moderate level of variation. On the other hand, the "Model" feature has a higher cardinality with 346 unique models, suggesting a more diverse range of vehicle variations.
The "Gear" feature has only two categories, indicating a binary classification of the transmission type (e.g., manual vs. automatic). Similarly, the "Fuel" feature has three categories representing different fuel types (e.g., gasoline, diesel, hybrid).
Understanding the cardinality of categorical features is important for various aspects of data analysis, including feature selection, encoding strategies, and model interpretation. High cardinality may require careful handling to avoid overfitting or computational challenges, while low cardinality features can simplify modeling and analysis tasks.
The cardinality of the model feature was a major concern in encoding categorical features. The high cardinality of the "model" feature lead to high dimensionality of the dataframe and the models.
During the dataset splitting process using the "train_test_split" function, stratification is required to ensure that one-hot encoding works properly. However, a challenge arises when dealing with entries in the "model" column that appear only once or twice. Stratification cannot be applied to these unique or minimally occurring entries.
To address this issue, we have identified three possible approaches. The first approach involves dropping the single or double occurrence entries of the "model" column. This reduces the complexity introduced by the limited occurrences during stratification. Alternatively, the second approach suggests duplicating or tripling the once or twice occurring entries to increase their representation in the dataset. This approach helps maintain balance and avoids losing potentially valuable information. Another option is to create a combined category for these specific models, treating them as a separate group during the splitting process. This approach can preserve the uniqueness of these entries while ensuring proper stratification.
For the current implementation, we have decided to drop the entries that appear only once or twice. However, we acknowledge that other approaches may be explored in the future to fully utilize the data from these unique or minimally occurring entries.
This section was significant for the prior encoding method, OneHot encoding, but would be redundant now. However, to be able to use old models, which were built using OneHot encoding, we will keep this part.
In this section, we focus on the implementation of transformer pipelines, which play a vital role in the preprocessing and feature engineering stages of our machine learning workflow.
When it comes to choosing between OneHot Encoder, Label Encoder, and Ordinal Encoder, the decision depends on the specific requirements of the machine learning models being used. Let's evaluate the applicability of these encoders for different models:
Decision Tree and Random Forest:
Decision trees and random forests can handle both categorical and numerical features effectively. They are not influenced by the encoding technique used, making them compatible with all three encoders. OneHot Encoder is suitable for decision trees and random forests as it can represent categorical variables without imposing an ordinal relationship. Label Encoder and Ordinal Encoder can also be used, but they might introduce an implicit order that may or may not be appropriate for the model.
XGBoost and AdaBoost:
XGBoost and AdaBoost are ensemble learning methods that uses gradient boosting. Similar to decision trees and random forests, they can handle both categorical and numerical features. OneHot Encoder, Label Encoder, and Ordinal Encoder can be used with both models. OneHot Encoder may result in high dimensionality, but XGBoost's ability to handle sparse data makes it feasible. However, considering the potential memory and computational limitations, careful consideration should be given to the choice of encoding.
KNN (K-Nearest Neighbors):
KNN is a distance-based algorithm that calculates similarity between data points. OneHot Encoder is not suitable for KNN as it can lead to the curse of dimensionality due to the high dimensionality introduced. Label Encoder and Ordinal Encoder can be used for KNN, but they assume an underlying order or ranking that may not be appropriate for categorical features. Thus, it is advisable to use alternative encoding techniques such as target-based encoding or frequency encoding to preserve the categorical information while mitigating the dimensionality issue.
SVR (Support Vector Regression):
SVR is a regression method that uses support vectors to find the best fit. Similar to KNN, OneHot Encoder can result in high dimensionality and is not recommended for SVR. Label Encoder and Ordinal Encoder can be used with SVR, but they assume an ordinal relationship that may not be valid for categorical features. Alternative encoding methods such as target encoding or effect encoding might be more suitable for SVR to capture the impact of categorical features accurately.
In summary, the choice of encoder depends on the specific machine learning models used. OneHot Encoder is generally suitable for decision trees, random forests, and XGBoost, while Label Encoder and Ordinal Encoder may introduce implicit ordering that can be inappropriate for some models. KNN and SVR require careful consideration of encoding techniques to address high dimensionality and preserve the meaningful information within categorical features.
Sklearn's label encoder is used for the target variable, not for feature variables. Ordinal encoding is supposed to be used on categorical feature variables. Ordinal encoding is easier to use than writing a custom encoder using label encoding. Ordinal implies an underlying rank of values, although there might be no real underlying ranking. Ordinal encoding ranks alphabetically, which might not make sense.
SHAP values for OneHot encoded models can be aggregated by summing the individual SHAP values, according to: https://github.com/slundberg/shap/issues/397
We trained all models with OneHot encoding and with ordinal encoding and recognized, that there is little difference in most of their prediction performances.
Even KNN, which could be effected by the introduced ranking of the ordinal encoder, showed similar performance to the KNN model with OneHot encoding.
As expected, the ordinal encoding influenced the SVR model's performance immensely. We assume that this is due to the introduced ranking in the ordinal encoding.
We additionally introduced AdaBoost to replace the SVR model for Ordinal encoding, because the SVR model was very much effected by the differences in encoding.
As expected, computation times were greatly reduced with the introduction of ordinal encoding.
For evaluation of the differences of OneHot encoding and Ordinal encoding, refer to the Appendix
Choosing the appropriate evaluation metrics is a critical step in assessing the performance of machine learning models. These metrics help us understand how well our models are performing and compare different models or algorithms against each other. In the context of our project, where we are predicting monthly leasing prices, we have selected several evaluation metrics to evaluate the quality of our models.
The Mean Squared Error (MSE) is a widely used metric that calculates the average squared difference between the predicted and actual values. It provides a measure of how close our predictions are to the true values, with lower values indicating better performance. The Root Mean Squared Error (RMSE) is derived from MSE by taking the square root of the average squared difference, which provides a more interpretable metric in the original scale of the target variable.
The Mean Absolute Error (MAE) is another commonly used metric that calculates the average absolute difference between the predicted and actual values. Like MSE, lower values of MAE indicate better model performance. MAE is less sensitive to outliers compared to MSE, making it a suitable choice when extreme values are present in the data.
The R-squared (R2) metric measures the proportion of variance in the target variable that is explained by the model. It ranges between 0 and 1, with higher values indicating a better fit. R2 is a valuable metric for assessing the overall goodness-of-fit of the model.
Additionally, we have chosen the Mean Absolute Percentage Error (MAPE or MAPR) as an evaluation metric. MAPE calculates the average percentage difference between the predicted and actual values, providing insights into the relative magnitude of errors. MAPE or MAPR is useful when we want to understand the accuracy of our predictions in relation to the actual values.
By utilizing these evaluation metrics, we can comprehensively evaluate the performance of our models and gain insights into their accuracy, precision, and generalization capabilities. This enables us to make informed decisions regarding model selection and fine-tuning to improve the predictive capabilities of our system.
For evaluation on the train and test set, we chose the following measurements:
Using a scoring dictionary, the RandomSearch algorithm calculates all scoring values, although it only refits on the given "refit" parameter, for which we used MSE. We also tried MAE and MAPR, but this did not make a noteable difference.
In this section, we focus on building a decision tree model for predicting vehicle leasing prices. Decision trees are powerful machine learning algorithms that can effectively handle both numerical and categorical data. They provide interpretable models that mimic the decision-making process, making them widely used and easily understandable.
To construct the decision tree model, we define a parameter distribution that includes various hyperparameters such as 'min_samples_split', 'min_samples_leaf', 'ccp_alpha', and 'random_state'. These hyperparameters control the behavior and complexity of the decision tree and need to be optimized to achieve the best performance.
We create a pipeline that includes a preprocessor for data transformation and a DecisionTreeRegressor as the main model. The preprocessor ensures that the data is properly prepared before being fed into the decision tree model.
To find the optimal combination of hyperparameters, we perform a randomized search with cross-validation using RandomizedSearchCV. This technique allows us to efficiently explore different hyperparameter settings and evaluate their impact on model performance.
After fitting the decision tree model to the training data, we evaluate its performance using various metrics on both the training and test sets. These metrics provide insights into the model's accuracy, precision, and generalization capabilities.
Additionally, we analyze the best hyperparameter values obtained from the randomized search and present them in a DataFrame. This information helps us understand the configuration of the decision tree model that yielded the best results.
The following section defines the final Decision Tree Regressor, or imports an existing one:
In this section, we focus on building a random forest model for predicting vehicle leasing prices. Random forest is an ensemble learning algorithm that combines multiple decision trees to make predictions. It is known for its robustness, ability to handle complex data, and resistance to overfitting.
We define a parameter distribution that includes hyperparameters such as 'n_estimators', 'max_depth', 'min_samples_split', 'min_samples_leaf', and 'random_state'. These hyperparameters control the behavior and complexity of the random forest model.
By performing a randomized search with cross-validation using RandomizedSearchCV, we explore different hyperparameter settings and evaluate their impact on model performance.
After fitting the random forest model to the training data, we evaluate its performance using various metrics on both the training and test sets. This helps us assess the model's accuracy and generalization capabilities.
The following section defines the final Random Forest Regressor, or imports an existing one:
In this section, we focus on building a K-Nearest Neighbors (KNN) model for predicting vehicle leasing prices. KNN is a non-parametric algorithm that uses the nearest neighbors in the training data to make predictions. It is known for its simplicity and versatility in handling different types of data.
We define a KNN pipeline that includes a preprocessor for data transformation and a KNeighborsRegressor as the main model.
To find the optimal combination of hyperparameters, we perform a randomized search with cross-validation using RandomizedSearchCV. The hyperparameters include 'n_neighbors', 'leaf_size', 'weights', and 'p', which control the number of neighbors, the leaf size of the tree, the weight function used in predictions, and the distance metric, respectively.
Despite the fact that KNN should theoretically consider the introduced ranking due to its reliance on distance-based neighbor grouping, it exhibited similar performance to the OneHot encoded KNN model. This suggests that the impact of the ranking on the KNN algorithm's predictions may be minimal in this particular context.
In this section, we focus on building a regression model using XGBoost (Extreme Gradient Boosting). XGBoost is a powerful machine learning algorithm known for its exceptional performance in various domains. It is an ensemble learning method that combines multiple decision trees to make accurate predictions. XGBoost incorporates gradient boosting techniques and introduces additional regularization to enhance model generalization and handle complex data patterns effectively.
To build the XGBoost model, we utilize a RandomizedSearchCV approach to search for the optimal combination of hyperparameters. These hyperparameters include the maximum depth of the trees, learning rate, number of estimators, gamma, subsample, colsample_bytree, min_child_weight, reg_lambda, reg_alpha, tree_method, and random_state. By performing cross-validation during the search, we ensure robust model evaluation and selection of hyperparameters that yield the best performance.
In addition XGBoost offers GPU acceleration, which increases computational performance and reduced model building time.
In this section, we focus on building a regression model using Support Vector Machines (SVM). SVM is a powerful algorithm that is widely used for regression tasks due to its ability to handle both linear and non-linear relationships in the data.
The SVM model is constructed using a pipeline that incorporates a preprocessor and an SVR (Support Vector Regression) regressor. The preprocessor handles data preprocessing steps, such as feature scaling and encoding, to ensure compatibility with the SVM model.
To find the optimal hyperparameters for the SVM model, we utilize RandomizedSearchCV. This technique performs a randomized search over a specified parameter distribution, allowing us to explore different combinations of hyperparameters efficiently. The hyperparameters we tune include 'C', which controls the regularization strength, 'kernel' for the choice of kernel function, and 'epsilon' that sets the margin of error allowed in the model.
SVM/SVR is a theoretically simple algorithm, but its computational complexity makes it a time-consuming process, often taking several hours to build. The time complexity of SVM is typically in the range of O(n^2) to O(n^3), where n is the number of training samples.
The 'linear' kernel of SVR has could potentially outperfrom the 'rbf' and 'poly' kernels, however, due to its extreme computational inefficiency, we were not able to finish building a model this way. We then removed 'linear' from the kernel parameter list, which reduced computation times from +10 hours to a mere 8 minutes.
Additionally, SVR is sensitive to the introduced ranking of categorical variables through Ordinal encoding, which resulted in very poor performance.
As expected, the SVM Regression is very much effected by the ordinal encoding. The results differ a lot compared to the OneHot encoding. See Encoding differences
In the AdaBoost Regressor building section, we utilize the AdaBoost algorithm to train a regression model. AdaBoost stands for Adaptive Boosting and is a popular ensemble learning method that combines multiple weak learners to create a strong learner.
We start by defining a pipeline consisting of a preprocessor and the AdaBoostRegressor. The preprocessor handles the data preprocessing steps, such as feature scaling or encoding. The AdaBoostRegressor is the core component that performs the boosting algorithm.
We specify a parameter distribution that contains different hyperparameters for the AdaBoostRegressor, including the number of estimators, learning rate, loss function, base estimator (such as DecisionTreeRegressor or RandomForestRegressor), and random state.
Next, we perform randomized search cross-validation using RandomizedSearchCV to explore different combinations of hyperparameters.
In this section, we compare the performances of our different machine learning models on the test data. Evaluating the models on unseen data is crucial to assess their generalization capabilities and determine their effectiveness in making predictions.
In this section, we analyze the performance of our machine learning models on out-of-sample data. Out-of-sample data refers to data that was not used during the model training and evaluation process. Evaluating the models on out-of-sample data provides a more realistic assessment of their performance and helps us understand how well they can generalize to new, unseen instances.
By examining the performance metrics on the out-of-sample data, we can determine how well our models are likely to perform in real-world scenarios. This evaluation allows us to validate the models' effectiveness, identify any potential issues or limitations, and make informed decisions about their deployment.
In this section, we delve into evaluating the importance of features in our machine learning models. Understanding the significance of different features can provide valuable insights into their impact on the prediction outcome. We use two methods for feature importance assessment: the SHAP (SHapley Additive exPlanations) library and the built-in feature importance functions.
The SHAP library offers a powerful tool for explaining individual predictions by quantifying the contribution of each feature. It provides a comprehensive view of feature importance by considering all possible feature combinations and their respective contributions. Additionally, we utilize the built-in feature importance functions provided by the selected machine learning models. These functions calculate the relevance of features based on various metrics specific to each algorithm.
We have two interpretations of feature importance from a Decision Tree model: one is based on Mean SHapley Additive exPlanations (SHAP) values, and the other is based on the inbuilt feature importance of the model. Both interpretations reveal insights into how features contribute to the model's predictive performance.
'Horsepower' is deemed as the most influential feature in both interpretations. With a SHAP value of over 160 and a feature importance score of about 0.7, it is clear that changes in 'Horsepower' significantly impact the model's predictions. Therefore, 'Horsepower' is a crucial feature for the decision-making process of this model.
The 'Registration' and 'Model' features are identified as the second most important variables, but in different interpretations. 'Registration' has a significant impact according to SHAP values, while 'Model' stands out in the inbuilt feature importance measure. This disparity may be due to the different ways these metrics calculate importance.
'Mileage' and 'Duration' both have comparable importance levels according to SHAP values and the inbuilt feature importance, with values around 30 and 0.06 respectively. This consistency suggests that while these features play a role in the model's decisions, their impact is less substantial compared to 'Horsepower' and 'Registration' or 'Model'.
Lastly, 'Emission' and 'Consumption' have been identified as having negligible influence in both interpretations. Their low SHAP values and feature importance scores suggest that these features contribute minimally to the model's predictive ability.
In summary, 'Horsepower' is the key feature in this model, followed by 'Registration' or 'Model', and then 'Mileage' and 'Duration'. The features 'Emission' and 'Consumption' have little to no impact on the model's decision-making process, indicating potential for simplifying the model without significantly impacting its accuracy. These interpretations can guide feature selection and engineering in future model iterations, and remind us that different feature importance methods may yield different perspectives.
From both the Mean SHAP values and the inbuilt feature importance of the Random Forest model, we observe similar patterns:
'Horsepower' is the most influential feature, with high SHAP values (~165) and importance score (~0.7), making it crucial for the model's decision-making.
'Registration' and 'Model' are secondary in importance. The SHAP values highlight 'Registration' more (~40), while the inbuilt importance emphasizes 'Model' more (~0.09).
'Mileage' is similarly impactful in both measures (~35 SHAP, ~0.06 importance), indicating its moderate contribution.
Lastly, 'Emission' and 'Consumption' are negligible in both interpretations, indicating their minimal impact on the model's predictive ability.
The interpretations for both Mean SHAP values and the built-in feature importance for the XGBoost model are as follows:
In the SHAP interpretation, 'Horsepower' is the most significant feature (~120), followed by 'Gear' (~40), 'Model' (~30), and 'Registration' (~25). 'Emission' and 'Consumption' are not significant.
The built-in feature importance of XGBoost, often determined by F-score (a measure of how frequently each feature appears in the model splits), indicates a similar importance of 'Horsepower' (~0.35) and 'Gear' (~0.3), followed by 'Duration' (~0.16). Again, 'Emission' and 'Consumption' aren't significant.
So, in both interpretations, 'Horsepower' is paramount, 'Gear' is important, while 'Emission' and 'Consumption' have little influence. The SHAP values emphasize the 'Model' and 'Registration' features, while the built-in importance underscores 'Duration'.
The method to compute this feature importance is through an F-score, which essentially measures how frequently each feature appears in the models created during the boosting process.
In XGBoost, each decision tree is built by repeatedly splitting the data into two groups. Each split involves a single feature at a time. The more frequently a feature is used in making splits across all trees, the higher its F-score, and thus the more important it is considered to be. This is because a feature that is often used for splitting is one that does a good job of separating the data, thereby improving the model's performance.
In your XGBoost model, 'Horsepower' has the highest built-in feature importance, followed by 'Gear' and then 'Duration'. This means that these three features are the ones most often used to split the data in your model, and thus they have the most significant impact on your model's predictions. Conversely, 'Emission' and 'Consumption' are not important, meaning they are rarely used in data splits and have little effect on the predictions.
It's worth noting that while built-in feature importance gives us a good indication of which features are most useful for making predictions, it doesn't tell us anything about the nature of the relationships between these features and the target variable.
The interpretations for both Mean SHAP values and the built-in feature importance for the AdaBoost model are as follows:
SHAP values indicate 'Horsepower' as the most significant feature (~160), followed by 'Model' (just below 40), 'Registration' (closely behind 'Model'), and 'Mileage' (~35). 'Emission' and 'Consumption' don't hold significant importance.
The built-in feature importance of AdaBoost, computed based on the weight of the evidence each feature provides across all the decision stumps, shows 'Horsepower' with the highest importance (just below 0.7). 'Model' follows (just below 0.15), and then 'Mileage' and 'Registration' (both ~0.05). Again, 'Emission' and 'Consumption' aren't significant.
The built-in feature importance in AdaBoost is computed based on the contribution of each feature to the weighted error rate of the model. In AdaBoost, each feature is used as a decision stump, and an importance score is calculated for each feature based on how much it decreases the weighted error of the model. The more a feature decreases this weighted error, the more important it is considered to be.
In conclusion, 'Horsepower', 'Model', 'Registration', and 'Mileage' are key features in the AdaBoost model according to both SHAP and built-in feature importance, with 'Emission' and 'Consumption' providing little influence. However, the SHAP and built-in feature importance differ slightly in the relative importance they assign to 'Model', 'Registration', and 'Mileage'.
Sklearn Version has to be >1.2.2 For that, Python>3.8 is required
This work draws inspiration from the master thesis conducted by Thomas Dornigg. To delve deeper into Dornigg's thesis, please refer to the following link:
As we progressed in the creation of this machine learning notebook, we shifted from OneHot encoding to ordinal encoding. Notably, while some algorithms, like tree-based models, demonstrate adaptability to the choice of encoding, others show heightened sensitivity. In particular, Support Vector Machine (SVM) and Support Vector Regression (SVR) algorithms can be influenced by the biases initiated by varying encoding techniques. This sensitivity emanates from the dependency of SVM and SVR on vector geometry, where actual geometric distances between data points significantly impact their computational process.
Before switching to ordinal encoding, we used OneHot encoding, which changed the results of the feature importance analysis. In addition, OneHot encoding increases the complexity of a model immensly, because it uses n-1 columns for n values inside a categorical feature.
For the OneHot encoded features, the results were, that the model of the car was not very significant. So we tried to build models with reduced complexity, the light models.
This section shows the building and evaluation of the light models and also shows the effect of leaving out the models in the model building process.
Drop light models Mention in appendix as tests Mention Onehot in appendix